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Abstract. For the evaluation of information flow in bivariate time se- 
ries, information measures have been employed, such as the transfer 
entropy (TE), the symbolic transfer entropy (STE), defined similarly 
to TE but on the ranks of the components of the reconstructed vectors, 
and the transfer entropy on rank vectors (TERV), similar to STE but 
forming the ranks for the future samples of the response system with 
regard to the current reconstructed vector. Here we extend TERV for 
multivariate time series, and account for the presence of confounding 
variables, called partial transfer entropy on ranks (PTERV). We in- 
vestigate the asymptotic properties of PTERV, and also partial STE 
(PSTE), construct parametric significance tests under approximations 
with Gaussian and gamma null distributions, and show that the para- 
metric tests cannot achieve the power of the randomization test using 
time-shifted surrogates. Using simulations on known coupled dynam- 
ical systems and applying parametric and randomization significance 
tests, we show that PTERV performs better than PSTE but worse than 
the partial transfer entropy (PTE). However, PTERV, unlike PTE, is 
robust to the presence of drifts in the time series and it is also not 
affected by the level of detrending. 



1 Introduction 

A fundamental concept when studying coupled systems, and dynamical systems in 
general, is the dependence of one variable X measured over time on another variable 
Y measured synchronously, where both X and Y may be considered as variables of 
one single system or representative variables of coupled subsystems. The best known 
concept of interdependence in time series is Granger causality, defined in terms of pre- 
dictability of linear stochastic systems [H] , i.e. X Granger causes Y if it improves the 
prediction of Y when included in the autoregressive model. Granger causality has been 
quantified with measures based on linear models in the time and frequency domain 
[121511 1J . Besides the direct extension of Granger causality to nonlinear prediction 
models [30], the idea of Granger causality has been formulated focusing on specific 
system dynamics properties, ranging from point distance interdependence |4I10] to 
phase synchronization 137146] . Here, we consider the formulation of Granger causality 
in terms of information flow from X to Y, which encompasses linear and nonlinear 
dynamic systems as well as stochastic processes (for a very recent controversial stand 
on Granger causality see [45]). 
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The most prominent quantification of information flow from X to Y is by the 
conditional mutual information (CMI), expressing the amount of information of the 
future state of Y explained by the current state of X accounting for the current state 
of Y. This was first proposed by Schreiber |41j under the name transfer entropy (TE) 
using standard delay embedding of the current states, and was further modified for 
non-uniform embedding [50139] . CMI and subsequently mutual information (MI) and 
entropy bear different expressions for discrete and continuous variables, the former 
being simpler to estimate. Conversion from continuous to discrete variables by bin- 
ning techniques is not recommended when the variables are high-dimensional because 
the estimation of information measures bears the same problems of estimating high 
dimensional density with a binning method [33]. A different and more efficient dis- 
cretization can be obtained by using the rank order of the delay components of the 
high-dimensional variables obtained by delay embedding, an idea first formulated for 
the entropy, called permutation entropy [7152] . Similar approaches have been devel- 
oped for TE, such as the symbolic transfer entropy (STE) [33], which was further 
corrected to the so-called transfer entropy on rank vectors (TERV) [54j, and the 
momentary information transfer [35] (see also [B] for a combination of permutation 
entropy and the directional synchronization index). 

A first contribution of this work is to extend TERV to be able to quantify the direct 
information flow from X to Y in the presence of other interacting variables called 
confounding variables, stacked in the variable Z, termed partial TERV (PTERV). 
The extension is straightforward and is the same as for TE, the partial TE (PTE) 
[35152] . and for STE, the partial STE (PSTE) [33]. 

For discrete variables and processes, the equivalence of entropy, in the form of 
Kolmogorov- Sinai entropy, and permutation entropy has been established theoreti- 
cally under different conditions [211812111] . A recent study carries the equivalence to 
TE and TERV (as well as STE) but at the asymptotic rate [17] . These results pave the 
way to take asymptotic properties of estimates of entropy and MI of discrete variables 
over to the estimates of entropy and MI of ranked versions of continuous variables, 
the latter being the points obtained from time series by delay embedding. A second 
contribution of this work is to derive estimates for the bias and variance of rank-based 
coupling measures (TERV, STE as well as PTERV and PSTE), form parametric tests 
of significance of TERV and STE and compare them with randomization tests using 
time-shifted surrogates, a common approach for nonlinear coupling measures [36j . 

Comparative studies have found that TE, and subsequently PTE, perform con- 
sistently well, [26128130] . Therefore we compare PTERV to PSTE using parametric 
and randomization significance tests and also to PTE using the randomization test on 
simulated known coupled chaotic systems. We consider also the presence of stochastic 
trends in the time series, a situation met often in many applications. 

The structure of the paper is as follows. In Section[5] we present first the measures 
of information flow for bivariate and multivariate time series and then the partial 
transfer entropy on rank vectors (PTERV). We present the asymptotic statistical 
properties of PTERV and the parametric and randomization significance tests for 
PTERV in Section [3J Then in Section H the significance tests for PTERV and PSTE 
are compared on simulated systems and the two measures are further compared to 
PTE as well. Discussion of the resutls and concluding remarks are given in Section [5] 



2 Transfer Entropy and Rank Vectors 

We briefly present first the information measures for bivariate time series {xt,yt}tL±, 
i.e. transfer entropy (TE), symbolic transfer entropy (STE) and transfer entropy on 
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rank vectors (TERV). Assuming standard delay embedding with the same embed- 
ding dimension to and delay t, a favorable choice for investigating coupling |31j . the 
reconstructed points from the two time series are x t = [xt,Xt- T , ■ ■ ■ , x t _r m _u T ] and 
Yt = [ytiVt-Tj ■ ■ ■ Jt-fm-ijr]: respectively. We define the measures for the informa- 
tion flow, or Granger causality, from X to Y, denoted X — > Y, assuming the driving 
system being represented by X and the response system by Y. The future state of the 
response is defined in terms of T times ahead denoted yf = [yt+i, ■ ■ • , Ut+r], where 
often the future horizon is limited to T = 1 (yf = yt+i)- 

2.1 Transfer Entropy 

Transfer entropy (TE) is the conditional mutual information J(y^;x t |y t ) and quan- 
tifies the information about the future of the response system, yf, obtained by the 
current state of the driving system, x t , that is not already contained in the current 
state of the response system, y t . In terms of entropy, TE is defined as 

TE X ^ Y = I(yf ;x t |y t ) = -H (yf, x t , y t ) + H (x t , y t ) + H(yf, y t ) - H(y t ), (1) 

where H(x) = L /(x) In /(x)dx is the differential entropy of a continuous variable 
x with domain X, and /(x) is the probability density function of x. In estimating 
TEx_>y one can assume discretization of the observed variables Xt and yt and use 
the Shannon entropy H(~x) = ^p(x)lnp(x) for the discrete variable x, where the 
sum is over the possible bins of x and p(x) is the probability mass function (pmf ) 
of x. However, binning methods are found to be more demanding on data size than 
other methods approximating directly the density function, and subsequently the 
differential entropy. In particular, for high dimensions, i.e. large to, the fc-nearest 
neighbor estimate turns out to be the most robust to time series length |23l50j . and 
we apply this in the comparative study in Section @] 

The inefficiency of the binning methods for estimating entropies is attributed to 
the bias due to the estimation of bin probability with the relative frequency of occur- 
rence of entries in the bin, and the variance due to having a number of unpopulated 
or poorly populated bins. The latter increases with the embedding dimension to, and 
it is noted that for the discretization of x t and y t in b bins the variable of highest 
dimension [yf ,x t ,y t ] in eq.JT]) regards b 2m+T bins. 

2.2 Symbolic Transfer Entropy 

A different discretization that produces far less bins for the high dimensional variables 
is provided by the rank ordering of the components of vector variables. For each 
point y t , the ranks of its components, say in ascending order, form a rank vector 
y< = [n,i,r t ,2, r t , m ], where r t ,j G {1, 2, . . . , m} for j = 1, . . . ,m, is the rank order 
of the component yt-(j-i) T - F° r t wo equal components of yt the smallest rank is 
assigned to the component appearing first in y t . Substituting rank vectors to sample 
vectors in the expression for Shannon entropy gives the so-called permutation entropy 
.ff(x) = ^p(x) lnj»(x), where the sum is over to! possible permutations of the to 
components of x [7]. 

The same conversion has been suggested for TE. In [33], the arguments in the 
CMI of TEx->y in eq.([T]) are modified as follows: x 4 and yt are substituted by the 
respective rank vectors x t and yt, and the future response vector yf is replaced by 
the response rank vector at time t + T, yt+T- This conversion of TE is called symbolic 
transfer entropy (STE) defined as 

STEx-^y = I(yt +T ;xt|yt) = -H(y t+T ,± t ,y t ) + H(x t ,y t ) + H(y t+T ,y t ) ~ H{y t ). 

(2) 
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2.3 Transfer Entropy on Rank Vectors 

STE is not the direct analogue to TE in terms of ranks. While the correspondence of 
y* to yt (and the same for x t and x t ) is direct and preserves the vector dimension, the 
vector yj = [y t +i, Vt+r] of dimension T is mapped to y t+T = [r t +T,u- ■ • , r t +T, m } 
of dimension m, the rank vector of yt+T — [yt+T, ■ ■ ■ ; Vt+T-fm-lw]- This indirect 
correspondence has implications in the computation of the entropy terms and thus 
the CMI. In particular, supposing r = 1, the joint vector [yt+TiYt] present in two 
entropy terms of cq.(fT]) has m + T consecutive components, and the possible rank 
orders of the corresponding rank vector are (m + T)\. Using the approach in STE, 
[y*+T,y*] is mapped to [yt+T,Yt] having ml ■ ( m "yy possible rank orders (for details 
see 

In [23], a correction of STE termed transfer entropy on rank vectors (TERV) is 
proposed. TERV assigns for the future response sample vector yf = [yt+i, • ■ • , Vt+r] 
in TE the future response rank vector yj = \rt,m+i, ■ ■ ■ , ft,m+T\ containing the ranks 
of [yt+i, ■ ■ ■ , yt+T] hi the augmented vector [y t , y^]. TERV is thus dchned as 

TERVx-j-y = I(yf; x t |y t ) = -H($f, x t , y t ) + H(x t ,y t ) + H(yf, y t ) - H(y t ). (3) 

We note that there is one-to-one mapping between the ranks of yt and the ranks 
of the first m components of the augmented vector [yt,y^] Q Thus the pmf for the 
first m components of the rank vector of [yt , y^] is the same as the pmf for the 
ranks of yt, defined on ml possible rank vectors. For the rest T components of the 
rank vector of [yt,y^], there are Ill=i( m + *) possible combinations, giving a total 
number of (m + T)l combinations for the rank vector of [yt,yf]- This is larger than 
the number of combinations STE assigns for [yt,yf], (m + T)l > ml ■ („" T )i , and 

thus the entropy terms H(yf ,x t ,yt) and H(yf,y t ) for the continuous variables in 
eq.([T]) for TE are correctly estimated by TERV but underestimated by STE, having 
H(y t+T ,± t ,y t ) < H(ff,± u ft) and H(y t+T ,y t ) < H{yJ \y t ) fJ3]. 

2.4 Partial Transfer Entropy on Rank Vectors (PTERV) 

Now we suppose we have K multivariate time series, denoted {xt, yt, Zi,t, ■ ■ ■ , ZK-2,t}t= 
in order to preserve the notation for the relation of interest X — > Y in the presence 
of K — 2 other observed variables Z±, . . . , Zjc-2- For direct Granger causality or di- 
rect information flow from X to V, one has to account for the confounding variables 
Z = {Zi, . . . , Zk-%}- The delay embedding for each Zj, i = 1,...,K — 2, gives 
Zj,t = [zi.t-, Zij-r, ■ ■ ■ j t_( m _i) r ], and for convenience we stack all the reconstructed 
points z^t to one z t = [zi. t , . . . , ^K-2,t\ of dimension (K — 2)m. 

TE has been extended to include the effect of the current state of Z on the future 
of the response Y and the current state of X, simply by adding it to the conditioning 
term of CMI. The so-called partial transfer entropy (PTE) is defined as [48132) 

PTE x ^ m = /(y t T ;x t |yt,Zt) (] 
= -H(y?,Xt,yt,Zt)+H(xL t ,y t ,z t ) + H(y?,y t ,z t )-H(yt,z t ). 1 ' 

The same formulation can be considered for STE and the partial symbolic transfer 
entropy (PSTE) is defined as [53] 

PSTE x ^ Y \z = -f(yt+T;x t |yt,zt) 

= -#(yt+T,x t ,yt,z t ) +if(x t) y t ,z t ) + H(y t+T , y t , z t ) ~ H(y t ,z t ). 

(5) 



For both there is a bijection to the same rank sequence as defined in [2117] . 
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Here we extend also TERV to the partial transfer entropy on rank vectors (PTERV) 
defined analogously as 

PTERV^ r | Z = /(y t T ;x t |y t ,z t ) 

= ~H(yf ,x t ,y 4 ,z t ) + H(x t ,y t ,z t ) + H(yf,f ti z t ) - H(y t ,z t ). 

(6) 

The dimension of the variables in the entropy terms for all expressions (PTE, PSTE 
and PTERV) increase with the number of variables K and the embedding dimension 
m. For PSTE and PTERV the possible rank permutations of z t = [zi jt , . . . ,ZK-2.t] 
are (m\) K ~ 2 , which shows that the demand for data size increases sharply with both 
K and m. 



3 Statistical Significance of PTERV 

A main disadvantage of nonlinear measures of Granger causality as opposed to linear 
causality measures is the lack of established asymptotic properties. However, deal- 
ing with discretized variables, as for TERV and PTERV (and respectively STE and 
PSTE), it is possible to do so, based on the estimation of Shannon entropy. There have 
been a number of works on estimates of Shannon entropy and their statistical proper- 
ties, and subsequently for mutual information 1271161341381311512914211311919125145] . 



3.1 Bias and variance in the estimation of entropy 

Here we follow the approximation for the bias and variance of Shannon entropy in 
|27j and [38] . and further extend it including higher order terms of the Taylor series 
expansion. 

Let us suppose that for the discrete variable X there are B states. For our setting 
X can be, say, x t , and then B — ml. Let the true probability for each state i = 
1, . . . , B, be pi, the observed frequency of state i be rii, and the estimated probability 
of state i be qi = rii/N , where N is the data size. The relative error in the estimation 
of the probability of state i is e; = (qi — pi)/pi. The following approach is based on 
the assumption that rii is a binomial random variable, rii ~ M(N,pi). 

The observed Shannon entropy is 

B B 

H obs = - E 9i m 9; = -'Yl/Pii 1 + £ i) MPii 1 + e 0) 

i=l i=l 
B B 

= -/,Pi In Pi - X] (P^ 1 lnPi + P^ 1 + e ') In ( 1 + £ «)) ' 



where the first term in the last expression is the true entropy H^. Using the Taylor 
expansion up to third order of ln(l + e»), and taking expectation we arrive at the 
expression for the bias 

B(tfobs) = (tfobs) - = "E (f (e?) - f(e?>) • (7) 



Making use of the first three moments of the binomial distribution, we have (ef) = 
(1 -pi)/(Npi) and (ef) = (1 - 3pi + 2pf)/(N 2 pf), and substituting them in eq.© we 
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get 

^* 3i?* 2 1 ^ 1 

B(^obs) = -^ ^^ + ^E-' ( 8 ) 

?:=i ' 

where B* is the number of states with positive observed frequency. The first term in 
the rhs of eq.© is the expression added to i? obs to give the bias-corrected entropy 
estimate of Miller-Madow [27] , which we expand here to third order Taylor approxi- 
mation. The information from the observed variable enters in the bias approximation 
of Miller-Madow only in terms of the number of observed states B*, but in eq.© 
also the observed frequencies are considered. However, the effect of the two addi- 
tional terms is downweighted by the data size N, and the bias in eq.© converges to 
— (B* - 1)/(2JV) with the increase of N. 

For the variance of the Shannon entropy estimate, we adopt the approximation 
given in |38) . where the error propagation formula 

Var(ff obs ) = f;(^) 2 Var(n 4 ) 

is invoked to arrive at the expression 

1 B 

Var(tf obs ) = - ^(Incfc + H ohs ) 2 qi (l - (9) 

i=l 

We note that for both bias and variance formulas in eq.© and eq.©, respectively, 
the independence of the binomial variables rii is assumed. 



3.2 Bias and variance of PTERV 

In order to derive the bias for PTERV, we first recall that the number of possible 
states for yj is B 4 = Y\J =1 (m + i), for x t is B 2 = ml, for y t is B 3 = ml, and 
for z t is B4 = (ml) K ~ 2 . These variables enter jointly in the entropy terms of the 
expression of PTERV in cq.®, and the possible states of the joint vector variables 
are as follows: B 1234 = B 1 B 2 B 3 B 4 for [yf ,k t ,y t ,z t ], B 234 = B 2 B 3 B 4 for [x t ,y t ,z t ], 
B i3 4 = B 4 B 3 B 4 for [yj,y t ,z t ], and B 34 = B 3 B 4 for [y t ,z t ]. The observed frequency 
and the estimated probability for [y^, x t , y t , z t ] are denoted as riijkt and qijki = 
riijki/{N — m — T), respectively, where the index i = 1, . . . , Si, denotes the state 
of yf, and respectively for the other indices. The observed frequency and estimated 
probability for the other three variables of smaller dimension are denoted in terms of 
the respective marginal distributions, e.g. n.±i and q.M for [y t ,z t ]. For each vector 
variable, the number of states with positive observed frequency is denoted with an 
asterisk in superscript, e.g. -B34 for the active states of [y t , z t ]. 

Substituting the bias expression in cq. ([8]) for each entropy term in the expression 
for PTERV in cq.©, we get the expression for the bias of PTERV 



B (^bs(yf;xt|yt:Z t )) = (I obs (yf;x t |y t ,z t )) -Ioo(y t T ;xt|yt,z t ) = 

^1234 ~~ -^234 ~ ^134 + ^34 ~ 4 3i? 1234 - 35^34 — 3-B 134 + 35^ 




(10) 



1 ri iM 
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For the variance of PTERV, we use the error propagation formula for Z obs (yf ; x t |y t , 
Var(Wyr;x t |y t ,z t) ) =i:E£i:( ^l i ^ l ^ ,i ' ) ) a V»('Hi*')- 

1=1 j=l k=l 1=1 x 

Following the same approximation as for the mutual information of scalar variables 
in |38j . we arrive at the expression for the variance 

h S* Bl B% 

Var(I obs (yf;x t |y t ,z t )) = j^—— - J2 (~ ln ^ M 

i=l j=l fe=l i=l 

+ lng. jfe; +lngj. fe( -lng.. fe( + -T obs (yf; x * |yt, z t )) 2 ?yfei (1 -qijkl)- (H) 
3.3 Parametric Significance Test for PTERV 

We consider now parametric tests for the significance of PTERV, i.e. for the null 
hypothesis H : /^(yf; x t |y t , z t ) = 0, indicating that the variables yj and x t are 
independent conditioned on the variable (y*, z t ). A natural choice for the asymptotic 
distribution of the estimate of mutual information is Gaussian, e.g. see [29119) . and 
employing the results for the bias in ca. lfTUj) and variance in ea. lfTTj, . we have the first 
candidate for the null distribution of the PTERV statistic 

Jobs(yf;*t|yi,Zt) ~ N (B(7 obs (yf ; x f |y t , z t )), Var(I obs (yf ; x t |y t , z t ))) • (12) 

The mutual information of two independent discrete variables X and Y is re- 
lated to the Chi-square statistic for the independence test with the expression \ 2 = 
2NI Q ^ S (X; Y), where the Chi-square distribution has (\X\ — 1)(|T| — 1) degrees of 
freedom (dof), \X\ is the number of states of X and N is the number of observa- 
tions of X and Y [34113] . Based on this result, Goebel et al [T3] proved that the 
statistic J obs (X;F) follows Gamma distribution. They extended their result to the 
conditional mutual information, and proved that I(X;Y\Z) when X and Y are in- 
dependent conditioned on Z follows also Gamma distribution with shape parameter 
« = |Z|(|X| — l)(|y| — l)/2 and scale parameter 9 = 1/N. This result adapted to our 
setting constitutes the second candidate for the null distribution of PTERV 

I oh8 (yJ ; x t |y t , z t ) ~ r ^(BJ - 1)(B* 2 1), {N _l_ T) ) ■ (13) 

If instead we trust the bias and variance approximation in cq. (fTU)) and eq.(fTTj). we 
can derive the shape and scale parameter of the Gamma null distribution from these, 
and then wc have the third candidate for the null distribution of PTERV 

WgbsiSfjMy^^f Var(/ obs (yf ;x t |y t ,z t )) \ 
Var(/ obs (yf;xt|y t) z t )' B(/ obs (y t T ; x t |y t , z t )) /' 

We note that the Gamma approximation of the null distribution of PTERV in cq. (fT4"]) . 
termed Gamma-2, converges to the Gaussian distribution in ea.(|12[l with the increase 
of bias, while the Gamma approximation in cq. (|13|) . termed Gamma-1, can differ in 
mean from Gamma-2 and Gaussian distributions. 

We show the differences of Gaussian, Gamma-1 and Gamma-2 distributions in 
approximating the true PTERV distribution with a simple example representing the 
null hypothesis of conditional independence. We generate the conditioning variable 



7 obs(yf;x t |y t ,z t ) 
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Fig. 1. The true distribution of PTERV and the three approximations, as given in the legend, 
formed from 1000 Monte Carlo realizations of the exemplary system of two independent 
variables conditioned on a third variable. The parameters are T = 1, N — 1024, m = 2 in 
(a) and m = 3 in (b). 



Z from a first order autoregressive process and let the other two variables depend 
linearly on Z 

x t = azt + e x ,t 

y t — bz t + cxt-i + e Vlt (15) 
z t = dzt-i + e z , t 

where a = 2, b = —1, d = 0.8, and e x j, e y j, e z .t are Gaussian white noise with 
standard deviation (SD) 1, 2 and 1, respectively. We set c = to have X and Y 
conditionally independent. In Fig. [TJ we show the approximate true distribution of 
PTERV (T = 1) from 1000 Monte Carlo realizations. We compute the average param- 
eter values from the 1000 realizations for the three distributions in eq.JTS]), ea. (fl"3)) 
and eq. (|14[) . and draw the three approximating distributions. Gamma-1 matches best 
the true distribution of PTERV for m = 2, but for m — 3 it lies at the right of the 
true distribution. For m = 3, Gaussian and Gamma-2 distributions converge but lie 
both far to the left of the true distribution. The deviation of all three approximations 
from the true distribution is larger when we further increase m or employ a more 
involved setting, e.g. using nonlinear dependence of X and Y on Z. 

First, we attempt to explain the shortcoming of Gamma-1 approximation for 
increasing to. Considering the Chi-squarc independence test in the setting of joint 
rank variables, we expect to have zero cells and incomplete contingency table, so that 
the degrees of freedom B^ 4 (Bl — 1)(B% — 1) of the Chi-square distribution associated 
with the Gamma-1 distribution in ca. (|13|) may be overestimated. In turn, this leads 
to ovcrcstimation of the bias B(7 ] 3S (y^; icj|yt, Zt)), because the mean of the null 
Gamma distribution is the product of the shape and scale parameter. To the best of 
our knowledge there is no standard approach to determine the degrees of freedom for 
the Chi-square statistic for incomplete contingency tables, especially when the zero 
cells depend on the correlation structure of the involved variables, e.g. see [22120] . 

Regarding the Gaussian and Gamma-2 approximations, we first note that the 
shape of the approximating distribution to the true null distribution of PTERV does 
not constitute an issue. The bias of PTERV is large (compared to the width of the 
distribution) even for small m and K and the distribution of PTERV becomes essen- 
tially Gaussian [33] . The estimation of the bias of PTERV, even when extending the 
Taylor approximation to third order terms (see ea. (|10[) ). is not accurate and deviates 
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at an extent that depends on the parameters involved in the definition of PTERV 
(m, K and T), and also on the inter-dependence of the components in (yt,z t ) and 
the dependence of yf and x t on (y t ,z t ). These dependencies cause violation of the 
assumption of independence of the binomial variables ny/y in the computation of 
the bias and the effect is larger when the number of states of the involved variables 
increase, i.e. either of m, K and T increases. 

3.4 Randomization Significance Test for PTERV 

The insufficiency of the analytic approximation of the true null distribution for the 
significance test for PTERV paves the way for employing resampling approaches. We 
consider here the randomization test making use of the time-shifted surrogates [36] . 
For PTERV, this is succeeded by circular shifting the components of the time series 
of the driving variable, {x t } = {x( m _ 1 j r+1 , X( m _ 1 ) r+2 , . . . , xat_t}, by a random time 
step w producing the surrogate time series 

{xj } = {x( m _i) T+u ,_|_i, . . . , xn-t, X( m _ 1 ) T+1 , . . . , X( m _ 1 ) T+ll ,}. 

The tuple {y^,x|, (y t ,z t )} represents Ho as the intrinsic dynamics of each variable 
are preserved but the coupling between the driving variable x* and the response 
variable yf is destroyed, so that it holds ^(yf; x£ |yt, zt) = 0. However, the time 
displacement of x t destroys also the possible dependence structure of x t on the condi- 
tioning variable (y t , z t ). This is a point of inconsistency that may affect the matching 
of the sample distribution of PTERV, computed on an ensemble of surrogate tu- 
ples {yf, x£ , (yt, z t )}, to the true null distribution of Ioo{yJ ; x t |y t , z t ). On the other 
hand, the construction of surrogate tuples {y^, x^ , (y t , z t )} that preserve all the orig- 
inal inter-dependencies and destroys only the coupling of y^" and xjf is an open, yet 
unsolved, problem. 

The randomization test is compared with the parametric tests on the basis of 
the example in cq. (|15|) . For one realization of the same system, the three paramet- 
ric null distributions and the null distribution formed by the PTERV values from 
1000 surrogates are shown in Fig. [5J For c = the variables X and Y are condi- 
tionally independent and Ho holds. For this case and m = 2, as shown in Fig. [2^, 
we obtain no rejection for any of the four one-sided tests, where the test statistic is 
J ^ s (y^; x t |y t , z t ) and the rejection regions are formed from each approximate null 
distribution. In this case the surrogate distribution is well matched by the Gamma- 
1 distribution. However, for m = 3, I \) S {yf' 7 *-t\yt, %t) lies well on the right of the 
Gaussian and Gamma-2 distributions giving false rejection, and on the left of the 
Gamma- 1 and surrogate distributions giving correctly no rejection (see Fig. HJd). The 
failure of the surrogate distribution to contain I i, s (yf', Xi|y*, Zt) is attributed to the 
inconsistency that the original X variable depends on Z but the surrogate X does 
not. 

When X drives Y (c = 1), all four approaches give correctly confident rejection of 
Ho for m = 2 (sec Fig.j^t), but for m = 3 the results are the same as for c = (Fig.[5Ji). 
Thus the inconsistency mentioned above affects the power of the surrogate data test, 
when there is dependence between the driving variable and the conditioning variable. 
When we remove this dependence, setting a = 0, the surrogate data test performs 
well for both m = 2 and m = 3 (see Fig. [2^ and f). Indeed for m = 3, the estimated 
probability of rejection of Ho at the significance level a = 0.05 from 1000 realizations 
is 99.8 for the test with Gamma-1 approximation, 68.6 for the randomization test and 
1.0 for Gaussian and Gamma-2 approximations. 

We argued in Sec. 12.41 that PSTE is suboptimal for its purpose and advocated 
for using PTERV instead. Having yt+T as response variable in PSTE (instead of 
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Fig. 2. The three parametric approximations of the null distribution of PTERV (T = 1) 
and the distribution formed by 1000 time-shifted surrogates, as given in the legend, for one 
realization of length N = 1024 from the system in eq. (|15|) with b = — 1, d = 0.8. The other 
parameters differ in the six panels as follows: (a) a = 2, c = 0, m — 2, (b) a = 2, c = 0, 
m = 3, (c) a = 2, c = 1, m = 2, (d) a = 2, c = 1, m = 3, (e) a = 0, c = 1, m = 2, (f) a = 0, 
c = 1, m — 3. The observed value of PTERV is shown by a vertical line. 
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yj used in PTERV) introduces stronger dependence of the response variable on the 
conditioning variable because yt+T and yt are formed from sample vectors that may 
have common components. In turn this produces smaller values of CMI and thus it is 
harder to identify significant causal effects. In particular, the power of the random- 
ization test with PSTE (quantified by the relative frequency of rejections of the null 
hypothesis of no coupling when there are true direct causal connections) is smaller 
even when there is no dependence between the driving variable X and the condi- 
tioning variable Z. For example, for the last setting of the system in eq. (|T5j) (a = 0, 
b = l,c=l,m = 3), using PSTE as test statistic there are no rejections for the 
randomization test and the test with Gamma- 1 approximation because PSTE lies 
always on the left tail of both null distributions. 

4 Simulations and Results 

We assess the ability of PTERV to identify the correct direct causal effects in multi- 
variate time series and compare it to PSTE and PTE (using the A:-nearest neighbor 
estimate and k = 5). For PTERV and PSTE we apply the parametric tests and the 
randomization test, and for PTE only the randomization test. To account for multi- 
ple testing, the correction of false discovery rate (FDR) is employed [5] • According to 
FDR, we order the p-values of the K(K — 1) tests, pi < P2 < ■■■ < Pk(K-i), and if 
Pk is the largest p-value for which pk < ak/(K(K — 1)) holds, then all H regarding 
Pi, . . . ,Pk are rejected at the significance level a. Thus the larger the K, the smaller 
the p- value has to be to give rejection. The implication of this for the randomization 
test is that the number M of surrogates has to be large because the smallest p-value 
that can be obtained from rank ordering is roughly 1/(M + 1)0 In the simulations 
we use K up to 5 and setting M = 100 we did not encounter any insufficiency of the 
randomization test due to the requirements of the FDR correction. 

4.1 Stationary time series 

We first consider the stationary multivariate time series from the system of K coupled 
Henon maps, defined as 

Xi tt = 1.4 - x\ t _ x + 0.3.t m _ 2 , i = 1, K 

x i>t = 1.4 - 0.5C(xi_ M _i + x i+1>t -i) + (1 - G)x\ t _ x + 0.3x M _ 2 , i = 2, . . . , K - 1 

where the parameter C determines the strength of coupling. The results for weak 
coupling (C = 0.2) and for K = 3 and K = 5 are shown in Table [TJ The results 
for Gamma-2 test are very similar to those of Gaussian test and are not shown. 
For all non-existing or indirect couplings, all measures and with all tests give none 
or only few rejections using FDR at the level of significance a = 0.05 (5 out of 
100 realizations), except the Gaussian approximation that gives always significant 
rejections for K = 5. The randomization test with PTE detects best the true direct 
causal effects, followed closely by the randomization test with PTERV, while using 
PSTE the power of the randomization test is decreased for K = 5 and even nullified 
for K = 3. The parametric test with Gamma-1 approximation has smaller power than 
the randomization test using PTERV and the opposite is observed using PSTE (the 
latter only for K = 5). 

2 Using the correction for the empirical cumulative function in [51] . we compute the p- 
value for the one-sided test as 1 - (r° - 0.326)/(M + 1 + 0.348), where r° is the rank of the 
original measure value in the ordered list of M + 1 values. 
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Table 1. The number of rejections (using the FDR correction) of the parametric tests, 
denoted as Gaussian and Gamma- 1, with PTERV and PSTE and the randomization tests, 
denoted as Surrogate, with PTERV, PSTE and PTE for 100 realizations of the system of 
K = 3 and K = 5 coupled Henon maps with C = 0.2, = 1024. The other parameters are 
m = 2, t = 1 and T = 1. 



The extension of the simulations to different coupling strengths C reveals the 
superiority of PTE and insufficiency of PSTE in detecting direct causal effects when 
the coupling is weak. In Fig. EJ PTERV, PSTE and PTE as well as the number 
of rejections in 100 realizations of the randomization tests are drawn for a range 
of C. Though the three measures are drawn at the same scale, their range is not 
comparable because PTE is estimated by nearest neighbors on the continuous- valued 
time series and its bias has different range than the bias of PTERV and PSTE, and 
the bias of PTERV and PSTE also differ due to different number of states of yf and 
yt+T, respectively. However, for the true coupling in Fig.|3^, one can note that PTE 
tends to increase steeper than PTERV, which in turn increases steeper than PSTE. 
Accordingly, PTE and PTERV are likely to be found significant for as small C as 
0.1, while PSTE gets significant only for strong coupling (C > 0.4). The lower level 
of PTE is negative, so that zero PTE is actually found significant (see for C = 0.1 
in Fig. which is counterintuitive. There are even some few rejections of Ho of 
no coupling when PTE is negative, as shown for large C for the erroneous causality 
X2 — > X\ in Fig. [3Jd. For the latter setting, PTERV is found statistically significant 
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Fig. 3. PTERV, PSTE and PTE measures (denoted collectively q) are given as functions of 
the coupling strength C for the true direct causality X\ —¥ X$ in (a) and the non-existing 
coupling X2 — > Xi in (b) for the coupled Henon maps of 3 variables (m = 2, r = 1, T = 1). 
The number of rejections in 100 realizations of the randomization test determines the size 
of a symbol displayed for each measure and C, where in the legend the size of the symbols 
regards 100 rejections. 



showing the tendency to detect coupling in the wrong direction when the coupling 
strength is large. 

For the simulation setting above, it was found that the Gaussian and Gamma-2 
parametric tests have decreased power compared to the Gamma- 1 parametric test and 
the randomization test. Further simulations showed that this holds only for m = 2 
and for larger m the bias is underestimated giving rejections for the Gaussian and 
Gamma-2 parametric tests, and on the other hand the Gamma-1 parametric test tends 
to be overly conservative. For example for K = 3, C = 0.2 and m = 3, Gaussian and 
Gamma-2 parametric tests give always rejection and Gamma-1 parametric test gives 
always no rejection for any pair of variables. On the other hand, the randomization 
test gives optimal results for both PTERV and PSTE, i.e. 100 rejections when there 
is true direct causality and no rejections when there is not, and the same is obtained 
with PTE. We note also that PTERV and PSTE converge in their performance as 
m gets larger because then the effect of defining the response rank vector differently 
weakens. 

We assess the measures on a system of three coupled Lorcnz subsystems 

ii = — 10xi + 10yi ii = — lOxi + lOjji + C(xi_i — Xi) 

Vi = -X1Z1 + 28.x 1 - yi iji = -XiZi + 28xi - jji i = 2,3 (16) 

Zi = xiyi - 8/3zi ii = x i y. l - 8/3zi 

The system is solved using the explicit Rungc-Kutta (4,5) method implemented 
in the solver ode45 in Matlab and the time series are generated at a sampling time 
of 0.01 time units. We observe the first variable of each of the three systems, i.e. the 
tuple of observed variables (X, Y, Z) is assigned to {x\, 22, £3), so the direct couplings 
are X — > Y and Y — > Z. We use the same coupling strength C for both couplings. 
For the flow system, longer time series are required to detect the information flow 
and in Table [2] we show the results for N = 4096 and N ~ 16384 for weak coupling 
(C = 2). Again the parametric tests fail for both PTERV and PSTE. For N = 4096, 
the randomization tests with PTERV and PSTE have good significance (we use again 
q = 0.05 and FDR correction) but almost no power, while the randomization test 
with PTE has optimal power and significance. It seems that the rank measures are 
more data demanding and as we increased ./V we could improve the power of their 
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Table 2. The number of rejections (using the FDR correction) of the Gamma- 1 parametric 
test with PTERV and PSTE and the randomization (surrogate) test with PTERV, PSTE 
and PTE for 100 realizations of the K = 3 coupled Lorenz systems with C = 2, for N = 4096 
and N = 16384. The other parameters are m = 4, r = 1 and T = 1. 



randomization test, as shown in Table [5] for N = 16384. However, both PTERV and 
PSTE tend to falsely detect the coupling Z — > X, while PTE detects also falsely 
and with higher estimated probability the indirect coupling X — > Z. Note that for 
this setting having a larger embedding dimension (m = 4) PSTE performs as well as 
PTERV. 

Further simulations for different coupling strengths reveal shortcomings of all mea- 
sures. As shown in Fig. H for N = 4096, PTERV and PSTE increase with C in the 
same way as PTE for the two true direct couplings. However, the power of the ran- 
domization tests for the two rank measures is much smaller than for PTE for the first 
true direct coupling (X —5- Y in Fig. 0^.) and at the zero level for the second true 
direct coupling (Y -> Z in Fig. 0b). On the other hand, both PTERV and PSTE are 
at the zero level when there is indirect or no coupling and the randomization test 
has the correct significance, unlike the randomization test with PTE that tends to 
reject with larger probability than the significance level, which increases with C. The 
latter is more pronounced for the indirect coupling (X — > Z in Fig. @Ji), while for 
no coupling, significant rejections are observed for large C (as for Y — > X shown in 

Fig. EH). 

Other simulations not shown here suggest that it is difficult to estimate the correct 
causality structure in this system and the results are sensitive to the choice of the 
parameters m, r and T. We note that when we added observational noise to the time 
series of the Henon and Lorenz coupled system we observed that the power of the 
randomization tests were reduced accordingly and in the same way for all causality 
measures. 



4.2 Non-stationary time series 

The simulation results showed that PTERV cannot reach the performance of PTE 
in detecting correctly the direct causal effects. However, there is a practical situation 
that PTERV can indeed be more useful than PTE, namely in the presence of drifts 
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Fig. 4. As Fig.0but for the three coupled Lorenz systems, m = 4, r = 1, T = 1, 4096, and 
the couplings X -> Y in (a), F — )• /? in (b), X -> Z in (c), and Y — > X in (d). 



in the time series, a common data condition in many applications. PTE along with 
many other measures of Granger causality assume stationarity of the time series. On 
the other hand, the measures applied to the ranks rather than the samples are little 
affected by non-stationarity in the mean. This is so because in the estimation of the 
effect of the driving variable on the response variable the level of their magnitudes is 
disregarded and only the relative magnitude ordering counts. 

We demonstrate the appropriateness of PTERV as compared to PTE on synthet- 
ically constructed non-stationary time series. We assume the coupled Henon maps in 
three variables and add to each time series a stochastic trend that is computed as 
follows. First, a Gaussian random walk time series of the same length as the original 
time series is generated, where the SD of the random steps is the same as the SD of 
the Henon time series. Then a moving average smoothing of order 100 is applied to 
the random walk time scries. This time series of smoothed stochastic trend is added 
to the time series of the first variable, X\, of the coupled Henon maps, and the same 
process is repeated for X^ and X3. An example is shown in Fig.[5j There are no stan- 
dard methods to remove stochastic trends and depending on the employed approach 
different time series may be derived and the original dynamics may be altered |47j . 
For example, the fit of a polynomial of order 15 seems to match well the slow drift 
(Fig. [5^) , but the derived time series after detrending, seems to have variations not 
expected in the original time series generated by the coupled Henon map (Fig. Eb)- 

We compute the measures PTERV, PSTE and PTE and conduct the Gamma- 
1 parametric test and randomization test on 100 non-stationary multivariate time 
series from the coupled Henon maps with C = 0.2, and also on the detrended time 
series subtracting the polynomial fit (see Table [3]). For the non-stationary time series, 
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Fig. 5. (a) A time series of the variable X\ of the coupled Henon maps (K = 3) after adding 
to it a stochastic trend. Superimposed is the fit of a polynomial of degree 15. (b) The time 
series in (a) detrended by subtracting the polynomial fit. 
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Table 3. The number of rejections (using the FDR correction) of the Gamma-1 parametric 
test with PTERV and PSTE and the randomization (surrogate) tests with PTERV, PSTE 
and PTE for 100 realizations of K = 3 coupled Henon maps with C = 0.2, N = 1024. The 
other parameters are m = 2, t = 1 and T = 1. The upper part is for the non-stationary time 
series (after a smoothed slow drift is added) and the lower part for the time series obtained 
after detrending with a polynomial of order 15. 



as expected, PTE fails to detect any coupling. On the other hand, PTERV and 
PSTE perform similarly to the case of no drifts with somehow smaller power of 
the randomization and Gamma-1 tests (compare Table [3] to Table [1]). Again PSTE 
has no power to detect the true direct causal effects. When PTE is applied to the 
detrended time series, it regains detection of the true direct causal effects, but it 
falsely estimates also the opposite causal effects with an estimated probability close 
to 0.2. On the other hand, the parametric and randomization test with PTERV gains 
almost the same power as if there was no drift to be detrended, while the tests with 
PSTE show again no power. 
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Fig. 6. (a) A time series of the variable X\ of the coupled Henon maps (K — 3) after adding 
to it a stochastic trend and superimposed is the moving average smoothing of different 
orders P, as shown in the legend, (b) and (c) PTERV, PSTE and PTE measures (denoted 
collectively q) as functions of the order P of the moving average smoothing used to detrend 
each time series from the coupled Henon maps of 3 variables (N = 1024, C = 0.2, m = 2, 
T = 1, T = 1). The number of rejections in 100 realizations of the randomization test 
determines the size of a symbol displayed for each measure and P, where in the legend the 
size of the symbols regards 100 rejections. The panel in (b) is for the coupling Xi — ► Xi and 
in (c) for X 2 -> X-y. 



The type of detrending turns out to have very little effect on the rank measures 
but it has on PTE. For the same example, and a weaker stochastic trend with the 
SD of the random walk increments being 0.6 times the SD of the coupled Henon time 
series, we apply smoothing with moving average of different orders P. As shown in 
Fig. HJi, a small P (but not as small to wash out the true dynamics, say one or two) 
ovcrfits the stochastic trend while a large P leaves some drift in the time series. A 
best P is hard to find, and even when using the same P as for the generation of the 
stochastic trend (P = 100), some small drift can be seen to remain, knowing that the 
true time series does not contain any slow fluctuations. We computed PTERV, PSTE 
and PTE and made randomization tests on 100 realizations for a range of orders P 
of moving average detrending, and the results for the true direct coupling X\ —> X2 
and the non-existing coupling X2 — > X\ are shown in Fig. [BJd and c, respectively. 
PTERV has striking robustness to the level of detrending and both its value and the 
percentage of rejections for the randomization test are at the same level for any P, 
including P = that regards no detrending. PSTE is equal stable but fails to detect 
the true direct coupling as was the case with stationary time series. PTE is affected 
by P and becomes very irregular. For the true coupling X\ — > X2, PTE increases 
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with P for small P up to a peak and then decreases with P. Accordingly, the power 
of the randomization test decreases down to the zero level for P = 250 reaching the 
same performance as for the generated non-stationary time series. Remarkably, for 
the non-existing coupling X2 —> X\ , the pattern of the dependence of PTE on P is the 
opposite, and for the range of P the true coupling was best detected, a larger number 
of rejections than the nominal significance level is observed for the randomization test. 
These results show that PTERV, unlike PTE, can directly be applied to multivariate 
time series with trends and avoid thus dctrending that may alter the dynamics and 
inter-dependencies. 



5 Discussion 

The rank measures of causality of partial transfer entropy on rank vectors (PTERV) 
and partial symbolic transfer entropy (PSTE) extend the bivariate causality mea- 
sures TERV and STE, respectively, to account for the presence of other observed 
confounding variables. In the same way that TERV was developed to modify the fu- 
ture response vector in STE in order to correspond exactly to transfer entropy (TE), 
PTERV is the direct analogue of partial transfer entropy (PTE), substituting samples 
with ranks. PTERV has a clear advantage over PTE if the latter is estimated by bin- 
ning methods because the domain of joint probabilities for PTERV is generally much 
smaller (for a dimension to the size of the domain for ranks is to! and for the binning 
is b m where b is the number of bins) . Therefore in our simulation study we compared 
PTERV and PSTE with PTE using an advanced estimate of nearest neighbors that 
is found to be much more efficient than any binning estimate. 

We attempted to determine the asymptotic properties of PTERV, and subse- 
quently PSTE, approximating first the bias and variance, and then the distribution 
of PTERV when there is no coupling. We considered approximations with Gaussian 
and Gamma distributions expressing their parameters in terms of the estimated bias 
and variance, and also using known results on Gamma approximation of conditional 
mutual information. We considered these approximations for the null distribution of 
parametric tests for the significance of PTERV (and PSTE). However, for the prob- 
lem of assessing direct coupling, there are often certain dependence structures of the 
driving and the response variable on the confounding variables, which the analytic 
approximations do not encompass and thus the parametric tests are likely to fail. 
We confirmed this with simulations that showed also that the randomization test 
using time-shifted surrogates, though it cannot encompass all types of dependencies, 
is superior to the parametric tests. More work is needed to improve the analytic ap- 
proximation of the distribution of PTERV, and direct causality measures in general, 
so as to make the parametric tests more accurate. 

The correct detection of the true direct causality is a difficult task, and depending 
on the underlying intrinsic dynamics of each subsystem and the coupling structure of 
them, one has to search for the optimal setting of the reconstruction parameters for 
all involved variables (response, driving, conditioning) as well as the horizon of the 
time ahead. The rank measures were found to be particularly sensitive to the choice of 
the reconstruction parameters. The embedding dimension has significant effect on the 
performance of the rank causality measures in the same way that the estimation of the 
probability mass function depend heavily on the dimension of the discrete variable. 
Moreover, while PTERV can estimate better direct causal effects than PSTE for small 
to, as to increases the two measures converge because the difference in the definition 
of the future response vector is small in the context of large embedding vectors. 
The simulations on chaotic coupled systems showed that PTERV performs always 
better than or equally good as PSTE. Compared to PTE using the nearest neighbor 
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estimate, PTERV is more data demanding and thus the power of the randomization 
test using PTERV is smaller. 

PTERV is robust to drifts in the time series and the simulations showed that 
both the power and significance of the randomization test with PTERV remains 
stable when stochastic trend is added in the time series, as well as under different 
levels of detrending. For these data conditions, PTE is found inappropriate and the 
same holds for any causality measure based on samples rather than ranks. 
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